Carlos Bort & Jordi Puigdellivol
2017-02-27
Introduction to data.table package
Basics data.table.
Intermediate data.table.
Advanced data.table and further steps
Data.table novice: Understand the package logic and its posibilities.
Data.table pr0s: Some useful tips.
Code: https://www.dropbox.com/s/wdf45ujyktk30f9/datatable_talk_code.R?dl=0
Understand how to use data.table
Create a practical guide to use it
Improve your daily work
Getting deep on the object class data.table
Benchmark session
The eternal dplyr vs data.table debate
Basic R programming code it is a bit verbose… (A BIT), right? And sometimes you have to wait for it….
Creating columns, filtering, grouping by… That’s why…
data.table()https://github.com/Rdatatable/data.table
Step 1, install the package:
#install.packages("data.table")
library(data.table)What is a data.table?
It is a mutable object, a improved version of a data.frame.
What does this package so exceptional?
How I transform a data.frame into a data.table?
data("diamonds",package = "ggplot2")class(diamonds)## [1] "tbl_df" "tbl" "data.frame"
setDT(diamonds)
class(diamonds)## [1] "data.table" "data.frame"
Better than…
diamonds<-as.data.frame(diamonds)Data.table is a data.frame too!
The examples will be done with diamonds dataset.
Exercices with flights dataset
data("flights",package = "nycflights13" )Sintaxis:
As you saw in the course, understanding data.table is like understanding SQL.
DT[i, j, by].
SQL[where, select, group_by].
Idea: Three posible separations and two comas:
Subset the elements i, by its row number or condition:
# Select the first three columns
diamonds[1:3]## carat cut color clarity depth table price x y z
## 1: 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2: 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3: 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
cut=="ideal", diamonds:# All cuts are Ideal :)
diamonds[cut=="Ideal"][1:2]## carat cut color clarity depth table price x y z
## 1: 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2: 0.23 Ideal J VS1 62.8 56 340 3.93 3.90 2.46
# By name
diamonds[["cut"]][1:10]## [1] Ideal Premium Good Premium Good Very Good Very Good
## [8] Very Good Fair Very Good
## Levels: Fair < Good < Very Good < Premium < Ideal
# By number
diamonds[,1][1:10]## carat
## 1: 0.23
## 2: 0.21
## 3: 0.23
## 4: 0.29
## 5: 0.31
## 6: 0.24
## 7: 0.24
## 8: 0.26
## 9: 0.22
## 10: 0.23
# For multiple columns
diamonds[,1:2][1:10]## carat cut
## 1: 0.23 Ideal
## 2: 0.21 Premium
## 3: 0.23 Good
## 4: 0.29 Premium
## 5: 0.31 Good
## 6: 0.24 Very Good
## 7: 0.24 Very Good
## 8: 0.26 Very Good
## 9: 0.22 Fair
## 10: 0.23 Very Good
For all the observations selected on i calculate:
diamonds[,sum(depth)]## [1] 3330763
For all the selected on i do the calculus j, by these different groups:
diamonds[,mean(carat),by=cut]## cut V1
## 1: Ideal 0.7028370
## 2: Premium 0.8919549
## 3: Good 0.8491847
## 4: Very Good 0.8063814
## 5: Fair 1.0461366
# Only for a subset
diamonds[color %in% c("E","H"),mean(carat),by=cut]## cut V1
## 1: Ideal 0.6765489
## 2: Premium 0.8678284
## 3: Good 0.8179511
## 4: Very Good 0.7797940
## 5: Fair 1.0650664
Counts the number of elements per item. Couting it’s a calculus, j separator.
diamonds[,.N,by=cut]## cut N
## 1: Ideal 21551
## 2: Premium 13791
## 3: Good 4906
## 4: Very Good 12082
## 5: Fair 1610
To create a column you do not have to rewrite your object, it is add by reference. LHS, RHS.
# Creating a carat * depth column. One at a time
diamonds[,carat_depth:=carat*depth]## carat cut color clarity depth table price x y z
## 1: 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2: 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3: 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4: 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
## 5: 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## ---
## 53936: 0.72 Ideal D SI1 60.8 57 2757 5.75 5.76 3.50
## 53937: 0.72 Good D SI1 63.1 55 2757 5.69 5.75 3.61
## 53938: 0.70 Very Good D SI1 62.8 60 2757 5.66 5.68 3.56
## 53939: 0.86 Premium H SI2 61.0 58 2757 6.15 6.12 3.74
## 53940: 0.75 Ideal D SI2 62.2 55 2757 5.83 5.87 3.64
## carat_depth
## 1: 14.145
## 2: 12.558
## 3: 13.087
## 4: 18.096
## 5: 19.623
## ---
## 53936: 43.776
## 53937: 45.432
## 53938: 43.960
## 53939: 52.460
## 53940: 46.650
# Or more than one column
diamonds[,`:=`(carat_depth = carat*depth,carat_depth_2 = carat*depth, carat_depth_3 = carat*depth )]## carat cut color clarity depth table price x y z
## 1: 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2: 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3: 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4: 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
## 5: 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## ---
## 53936: 0.72 Ideal D SI1 60.8 57 2757 5.75 5.76 3.50
## 53937: 0.72 Good D SI1 63.1 55 2757 5.69 5.75 3.61
## 53938: 0.70 Very Good D SI1 62.8 60 2757 5.66 5.68 3.56
## 53939: 0.86 Premium H SI2 61.0 58 2757 6.15 6.12 3.74
## 53940: 0.75 Ideal D SI2 62.2 55 2757 5.83 5.87 3.64
## carat_depth carat_depth_2 carat_depth_3
## 1: 14.145 14.145 14.145
## 2: 12.558 12.558 12.558
## 3: 13.087 13.087 13.087
## 4: 18.096 18.096 18.096
## 5: 19.623 19.623 19.623
## ---
## 53936: 43.776 43.776 43.776
## 53937: 45.432 45.432 45.432
## 53938: 43.960 43.960 43.960
## 53939: 52.460 52.460 52.460
## 53940: 46.650 46.650 46.650
# Or more than one column
diamonds[,`:=`(carat_depth = carat*depth,carat_depth_2 = carat*depth, carat_depth_3 = carat*depth )]## carat cut color clarity depth table price x y z
## 1: 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2: 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3: 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4: 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
## 5: 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## ---
## 53936: 0.72 Ideal D SI1 60.8 57 2757 5.75 5.76 3.50
## 53937: 0.72 Good D SI1 63.1 55 2757 5.69 5.75 3.61
## 53938: 0.70 Very Good D SI1 62.8 60 2757 5.66 5.68 3.56
## 53939: 0.86 Premium H SI2 61.0 58 2757 6.15 6.12 3.74
## 53940: 0.75 Ideal D SI2 62.2 55 2757 5.83 5.87 3.64
## carat_depth carat_depth_2 carat_depth_3
## 1: 14.145 14.145 14.145
## 2: 12.558 12.558 12.558
## 3: 13.087 13.087 13.087
## 4: 18.096 18.096 18.096
## 5: 19.623 19.623 19.623
## ---
## 53936: 43.776 43.776 43.776
## 53937: 45.432 45.432 45.432
## 53938: 43.960 43.960 43.960
## 53939: 52.460 52.460 52.460
## 53940: 46.650 46.650 46.650
diamonds[cut=="Fair",carat_depth:=0]## carat cut color clarity depth table price x y z
## 1: 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2: 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3: 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4: 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
## 5: 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## ---
## 53936: 0.72 Ideal D SI1 60.8 57 2757 5.75 5.76 3.50
## 53937: 0.72 Good D SI1 63.1 55 2757 5.69 5.75 3.61
## 53938: 0.70 Very Good D SI1 62.8 60 2757 5.66 5.68 3.56
## 53939: 0.86 Premium H SI2 61.0 58 2757 6.15 6.12 3.74
## 53940: 0.75 Ideal D SI2 62.2 55 2757 5.83 5.87 3.64
## carat_depth carat_depth_2 carat_depth_3
## 1: 14.145 14.145 14.145
## 2: 12.558 12.558 12.558
## 3: 13.087 13.087 13.087
## 4: 18.096 18.096 18.096
## 5: 19.623 19.623 19.623
## ---
## 53936: 43.776 43.776 43.776
## 53937: 45.432 45.432 45.432
## 53938: 43.960 43.960 43.960
## 53939: 52.460 52.460 52.460
## 53940: 46.650 46.650 46.650
diamonds[cut=="Fair",sum(carat_depth)]## [1] 0
Remove inmediatly without taking into account the size of data.table.
# Delete one
diamonds[,carat_depth_3:=NULL]## carat cut color clarity depth table price x y z
## 1: 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2: 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3: 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4: 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
## 5: 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## ---
## 53936: 0.72 Ideal D SI1 60.8 57 2757 5.75 5.76 3.50
## 53937: 0.72 Good D SI1 63.1 55 2757 5.69 5.75 3.61
## 53938: 0.70 Very Good D SI1 62.8 60 2757 5.66 5.68 3.56
## 53939: 0.86 Premium H SI2 61.0 58 2757 6.15 6.12 3.74
## 53940: 0.75 Ideal D SI2 62.2 55 2757 5.83 5.87 3.64
## carat_depth carat_depth_2
## 1: 14.145 14.145
## 2: 12.558 12.558
## 3: 13.087 13.087
## 4: 18.096 18.096
## 5: 19.623 19.623
## ---
## 53936: 43.776 43.776
## 53937: 45.432 45.432
## 53938: 43.960 43.960
## 53939: 52.460 52.460
## 53940: 46.650 46.650
# More than one
my_columns<-c("carat_depth","carat_depth_2")
diamonds[,(my_columns):=NULL]## carat cut color clarity depth table price x y z
## 1: 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2: 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3: 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4: 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
## 5: 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## ---
## 53936: 0.72 Ideal D SI1 60.8 57 2757 5.75 5.76 3.50
## 53937: 0.72 Good D SI1 63.1 55 2757 5.69 5.75 3.61
## 53938: 0.70 Very Good D SI1 62.8 60 2757 5.66 5.68 3.56
## 53939: 0.86 Premium H SI2 61.0 58 2757 6.15 6.12 3.74
## 53940: 0.75 Ideal D SI2 62.2 55 2757 5.83 5.87 3.64
You can creat multiple agregations with list(newcolumn_1=sum(x),newcolumn_2=sd(x),...)
# You can change list() for . in the next example ;)
diamonds[, list(mean_carat=mean(carat),sd_carat=sd(carat)), by=list(cut,color)][1:10]## cut color mean_carat sd_carat
## 1: Ideal E 0.5784012 0.3125406
## 2: Premium E 0.7177450 0.4097847
## 3: Good E 0.7451340 0.3808900
## 4: Premium I 1.1449370 0.6136041
## 5: Good J 1.0995440 0.5371248
## 6: Very Good J 1.1332153 0.5559197
## 7: Very Good I 1.0469518 0.5519840
## 8: Very Good H 0.9159485 0.5029465
## 9: Fair E 0.8566071 0.3645848
## 10: Ideal J 1.0635938 0.5821001
You can combine different data.table instruction by chaining theme with the square brackets.
diamonds[, .(mean_carat=mean(carat),sd_carat=sd(carat)), by=.(cut,color)][order(-cut)][1:10]## cut color mean_carat sd_carat
## 1: Ideal E 0.5784012 0.3125406
## 2: Ideal J 1.0635938 0.5821001
## 3: Ideal I 0.9130291 0.5537277
## 4: Ideal G 0.7007146 0.4106182
## 5: Ideal D 0.5657657 0.2993503
## 6: Ideal F 0.6558285 0.3745245
## 7: Ideal H 0.7995249 0.4868741
## 8: Premium E 0.7177450 0.4097847
## 9: Premium I 1.1449370 0.6136041
## 10: Premium F 0.8270356 0.4201959
Change columns order by reference.
colnames(diamonds)## [1] "carat" "cut" "color" "clarity" "depth" "table" "price"
## [8] "x" "y" "z"
# Reverse the column names by reference
setcolorder(diamonds,rev(colnames(diamonds)))
colnames(diamonds)## [1] "z" "y" "x" "price" "table" "depth" "clarity"
## [8] "color" "cut" "carat"
data("flights",package = "nycflights13" )Flight columns:
year,month,day Date of departure.
dep_time,arr_time Actual departure and arrival times, local tz.
sched_dep_time,sched_arr_time Scheduled departure and arrival times, local tz.
dep_delay,arr_delay Departure and arrival delays, in minutes. Negative times represent early departures/arrivals.
hour,minute Time of scheduled departure broken into hour and minutes. carrier Two letter carrier abbreviation. See airlines to get name tailnum Plane tail number flight Flight number.
origin,dest Origin and destination. See airports for additional metadata.
air_time Amount of time spent in the air, in minutes.
distance Distance between airports, in miles.
time_hour Scheduled date and hour of the flight as a POSIXct date. Along with origin, can be used to join flights data to weather data.
1.- Transform flights into a data.table (setDT())
## [1] "tbl_df" "tbl" "data.frame"
## [1] "data.table" "data.frame"
2.- How many where destinated to Miami?
## [1] 11728
3.- Calculate the total days of air_time (has NA’s). (pssst: lubridate:::seconds_to_period() )
## [1] 34254.59
## [1] "34254d 14H 10M 0S"
4.- Mean difference by origin airport.
## origin V1
## 1: EWR 15.10795
## 2: LGA 10.34688
## 3: JFK 12.11216
5.- Create a new data.table delays_month with mean, sd, and number of flights per month. (Same column names!)
## month mean_delay sd_delay n_flights
## 1: 1 10.036665 36.39031 27004
## 2: 10 6.243988 29.67176 28889
## 3: 11 5.435362 27.58836 27268
## 4: 12 16.576688 41.87681 28135
## 5: 2 10.816843 36.26655 24951
## 6: 3 13.227076 40.13097 28834
## 7: 4 13.938038 42.96626 28330
## 8: 5 12.986859 39.35283 28796
## 9: 6 20.846332 51.45694 28243
## 10: 7 21.727787 51.61608 29425
## 11: 8 12.611040 37.66692 29327
## 12: 9 6.722476 35.61480 27574
6.- Create a new binary column named not_delayed when dep_delay <0 , equals 1. (psssst: ifelse()).
## year month day dep_time sched_dep_time dep_delay arr_time
## 1: 2013 1 1 517 515 2 830
## 2: 2013 1 1 533 529 4 850
## 3: 2013 1 1 542 540 2 923
## 4: 2013 1 1 544 545 -1 1004
## 5: 2013 1 1 554 600 -6 812
## ---
## 336772: 2013 9 30 NA 1455 NA NA
## 336773: 2013 9 30 NA 2200 NA NA
## 336774: 2013 9 30 NA 1210 NA NA
## 336775: 2013 9 30 NA 1159 NA NA
## 336776: 2013 9 30 NA 840 NA NA
## sched_arr_time arr_delay carrier flight tailnum origin dest
## 1: 819 11 UA 1545 N14228 EWR IAH
## 2: 830 20 UA 1714 N24211 LGA IAH
## 3: 850 33 AA 1141 N619AA JFK MIA
## 4: 1022 -18 B6 725 N804JB JFK BQN
## 5: 837 -25 DL 461 N668DN LGA ATL
## ---
## 336772: 1634 NA 9E 3393 NA JFK DCA
## 336773: 2312 NA 9E 3525 NA LGA SYR
## 336774: 1330 NA MQ 3461 N535MQ LGA BNA
## 336775: 1344 NA MQ 3572 N511MQ LGA CLE
## 336776: 1020 NA MQ 3531 N839MQ LGA RDU
## air_time distance hour minute time_hour not_delayed
## 1: 227 1400 5 15 2013-01-01 05:00:00 0
## 2: 227 1416 5 29 2013-01-01 05:00:00 0
## 3: 160 1089 5 40 2013-01-01 05:00:00 0
## 4: 183 1576 5 45 2013-01-01 05:00:00 1
## 5: 116 762 6 0 2013-01-01 06:00:00 1
## ---
## 336772: NA 213 14 55 2013-09-30 14:00:00 NA
## 336773: NA 198 22 0 2013-09-30 22:00:00 NA
## 336774: NA 764 12 10 2013-09-30 12:00:00 NA
## 336775: NA 419 11 59 2013-09-30 11:00:00 NA
## 336776: NA 431 8 40 2013-09-30 08:00:00 NA
7.- Calculate the mean of not delayed flights by company. Use the not_delayed column and name it as perc_not_delayed! Use the chaining to order by perc_not_delayed. (Add .N column)
## carrier perc_not_delayed N
## 1: HA 0.7573099 342
## 2: US 0.7276707 20536
## 3: OO 0.6896552 32
## 4: AS 0.6432584 714
## 5: AA 0.6332845 32729
## 6: MQ 0.6288996 26397
## 7: DL 0.6208831 48110
## 8: 9E 0.5579927 18460
## 9: B6 0.5529362 54635
## 10: YV 0.5394495 601
## 11: EV 0.5171353 54173
## 12: VX 0.4862600 5162
## 13: UA 0.4712223 58665
## 14: FL 0.4395984 3260
## 15: F9 0.4354839 685
## 16: WN 0.3755690 12275
8.- Load the airlines dataset. With one line of code, filter the airlines dataset with the previous 10 best airlines. (ppssst: airlines[column %in% super_chunk_of_code])
data("airlines",package = "nycflights13" )
setDT(airlines)## carrier name
## 1: 9E Endeavor Air Inc.
## 2: AA American Airlines Inc.
## 3: AS Alaska Airlines Inc.
## 4: B6 JetBlue Airways
## 5: DL Delta Air Lines Inc.
## 6: HA Hawaiian Airlines Inc.
## 7: MQ Envoy Air
## 8: OO SkyWest Airlines Inc.
## 9: US US Airways Inc.
## 10: YV Mesa Airlines Inc.
9.- Use the previous delays_month data.table and plot the mean delays with the sd delays. Try to use ggplot2 and ggrepel (for the labels). (+ info, google)
10.- Replicate this plot.
## month carrier mean_delay
## 1: 1 UA 8.326167
## 2: 1 AA 6.932358
## 3: 1 B6 9.493436
## 4: 1 DL 3.849768
## 5: 1 EV 24.228879
## ---
## 181: 9 9E 7.754232
## 182: 9 HA -5.440000
## 183: 9 F9 8.263158
## 184: 9 YV 8.880952
## 185: 9 OO -4.941176
Inspired by: https://cran.r-project.org/web/packages/ggrepel/vignettes/ggrepel.html
Fast add, remove and update subsets of columns, by reference. := operator can be used in two ways: LHS := RHS form, and Functional form.
diamonds[ , `:=` ( avg_price = round( mean( price ), 2), avg_carat = round( mean( carat ), 2) ) , by = cut ]## z y x price table depth clarity color cut carat
## 1: 2.43 3.98 3.95 326 55 61.5 SI2 E Ideal 0.23
## 2: 2.31 3.84 3.89 326 61 59.8 SI1 E Premium 0.21
## 3: 2.31 4.07 4.05 327 65 56.9 VS1 E Good 0.23
## 4: 2.63 4.23 4.20 334 58 62.4 VS2 I Premium 0.29
## 5: 2.75 4.35 4.34 335 58 63.3 SI2 J Good 0.31
## ---
## 53936: 3.50 5.76 5.75 2757 57 60.8 SI1 D Ideal 0.72
## 53937: 3.61 5.75 5.69 2757 55 63.1 SI1 D Good 0.72
## 53938: 3.56 5.68 5.66 2757 60 62.8 SI1 D Very Good 0.70
## 53939: 3.74 6.12 6.15 2757 58 61.0 SI2 H Premium 0.86
## 53940: 3.64 5.87 5.83 2757 55 62.2 SI2 D Ideal 0.75
## avg_price avg_carat
## 1: 3457.54 0.70
## 2: 4584.26 0.89
## 3: 3928.86 0.85
## 4: 4584.26 0.89
## 5: 3928.86 0.85
## ---
## 53936: 3457.54 0.70
## 53937: 3928.86 0.85
## 53938: 3981.76 0.81
## 53939: 4584.26 0.89
## 53940: 3457.54 0.70
vars = c( "price" , "carat" )
new_vars=paste0("avg_",vars)
diamonds[ , c( new_vars, "clarity" ) := list( round( avg_price, 2), round( avg_carat, 2 ) , as.character(clarity) ) ]## z y x price table depth clarity color cut carat
## 1: 2.43 3.98 3.95 326 55 61.5 SI2 E Ideal 0.23
## 2: 2.31 3.84 3.89 326 61 59.8 SI1 E Premium 0.21
## 3: 2.31 4.07 4.05 327 65 56.9 VS1 E Good 0.23
## 4: 2.63 4.23 4.20 334 58 62.4 VS2 I Premium 0.29
## 5: 2.75 4.35 4.34 335 58 63.3 SI2 J Good 0.31
## ---
## 53936: 3.50 5.76 5.75 2757 57 60.8 SI1 D Ideal 0.72
## 53937: 3.61 5.75 5.69 2757 55 63.1 SI1 D Good 0.72
## 53938: 3.56 5.68 5.66 2757 60 62.8 SI1 D Very Good 0.70
## 53939: 3.74 6.12 6.15 2757 58 61.0 SI2 H Premium 0.86
## 53940: 3.64 5.87 5.83 2757 55 62.2 SI2 D Ideal 0.75
## avg_price avg_carat
## 1: 3457.54 0.70
## 2: 4584.26 0.89
## 3: 3928.86 0.85
## 4: 4584.26 0.89
## 5: 3928.86 0.85
## ---
## 53936: 3457.54 0.70
## 53937: 3928.86 0.85
## 53938: 3981.76 0.81
## 53939: 4584.26 0.89
## 53940: 3457.54 0.70
In data.table parlance, all set* functions change their input by reference. That is, no copy is made at all, other than temporary working memory, which is as large as one column.. The only other data.table operator that modifies input by reference is :=.
setkey() sorts a data.table and marks it as sorted (with an attribute sorted). The sorted columns are the key. The key can be any columns in any order. The columns are sorted in ascending order always.
setkey(diamonds,carat)
setkeyv( diamonds, c( "carat", "cut" ) )setnames( diamonds, c("avg_price","avg_carat" ),c("mean_price","mean_carat" ) )Intended for use in i in [.data.table.
diamonds[price %between% c(300,500),.(carat,price,cut)]## carat price cut
## 1: 0.20 367 Very Good
## 2: 0.20 345 Premium
## 3: 0.20 367 Premium
## 4: 0.20 367 Premium
## 5: 0.20 367 Premium
## ---
## 1745: 0.40 491 Good
## 1746: 0.40 484 Ideal
## 1747: 0.41 467 Good
## 1748: 0.43 490 Good
## 1749: 0.43 452 Premium
range = data.table(lower = c(2000,3000), upper = c(2500,3500))
diamonds[price %inrange% range, .(carat,price,cut)]## carat price cut
## 1: 0.30 2366 Very Good
## 2: 0.34 2160 Fair
## 3: 0.34 2287 Ideal
## 4: 0.34 2287 Ideal
## 5: 0.34 2346 Ideal
## ---
## 5501: 1.50 3179 Good
## 5502: 1.50 3457 Premium
## 5503: 1.51 3497 Good
## 5504: 1.52 3105 Good
## 5505: 1.52 3105 Good
.SD, .BY, .N, .I and .GRP are read only symbols for use in j. .N can be used in i as well
.SD is a data.table containing the Subset of x’s Data for each group, excluding any columns used in by (or keyby).
diamonds[,lapply(.SD,floor),.SDcols=c("x","y","z")]## x y z
## 1: 3 3 2
## 2: 3 3 2
## 3: 3 3 2
## 4: 3 3 2
## 5: 3 3 2
## ---
## 53936: 10 10 6
## 53937: 10 9 6
## 53938: 10 9 6
## 53939: 10 10 6
## 53940: 10 10 6
newest_vars=paste0("new_mean_",vars)
diamonds[ , c(newest_vars) := lapply( .SD, function(x){ round( mean( x ) ,2 ) }) , by = cut,.SDcols=vars ]## z y x price table depth clarity color cut carat
## 1: 2.36 3.71 3.74 367 59 63.4 VS2 E Very Good 0.20
## 2: 2.27 3.75 3.79 345 62 60.2 SI2 E Premium 0.20
## 3: 2.26 3.77 3.79 367 62 59.8 VS2 E Premium 0.20
## 4: 2.24 3.78 3.81 367 60 59.0 VS2 E Premium 0.20
## 5: 2.32 3.78 3.81 367 59 61.1 VS2 E Premium 0.20
## ---
## 53936: 6.17 10.10 10.14 15223 61 61.0 I1 I Premium 4.01
## 53937: 6.24 9.94 10.02 15223 62 62.5 I1 J Premium 4.01
## 53938: 6.43 9.85 10.00 17329 61 64.8 I1 H Fair 4.13
## 53939: 6.72 10.16 10.23 18531 58 65.8 I1 J Fair 4.50
## 53940: 6.98 10.54 10.74 18018 59 65.5 I1 J Fair 5.01
## mean_price mean_carat new_mean_price new_mean_carat
## 1: 3981.76 0.81 3981.76 0.81
## 2: 4584.26 0.89 4584.26 0.89
## 3: 4584.26 0.89 4584.26 0.89
## 4: 4584.26 0.89 4584.26 0.89
## 5: 4584.26 0.89 4584.26 0.89
## ---
## 53936: 4584.26 0.89 4584.26 0.89
## 53937: 4584.26 0.89 4584.26 0.89
## 53938: 4358.76 1.05 4358.76 1.05
## 53939: 4358.76 1.05 4358.76 1.05
## 53940: 4358.76 1.05 4358.76 1.05
.BY is a list containing a length 1 vector for each item in by. This can be useful when by is not known in advance. The by variables are also available to j directly by name; useful for example for titles of graphs if j is a plot command, or to branch with if() depending on the value of a group variable.
par(mfrow=c(2,4))
invisible( diamonds[,.(hist(price)),by=clarity] )invisible( diamonds[,.(hist(price,main=paste(.BY))),by=clarity] )par(mfrow=c(1,1)).GRP is an integer, length 1, containing a simple group counter. 1 for the 1st group, 2 for the 2nd, etc.
diamonds[,.GRP,by=clarity]## clarity GRP
## 1: VS2 1
## 2: SI2 2
## 3: SI1 3
## 4: VVS2 4
## 5: VS1 5
## 6: VVS1 6
## 7: IF 7
## 8: I1 8
0.- Rename not_delayed column to i_hate_delays
## [1] "year" "month" "day" "dep_time"
## [5] "sched_dep_time" "dep_delay" "arr_time" "sched_arr_time"
## [9] "arr_delay" "carrier" "flight" "tailnum"
## [13] "origin" "dest" "air_time" "distance"
## [17] "hour" "minute" "time_hour" "i_hate_delays"
1.- Cast all the character filds tolower(). ( Of course in one line ;)
vars = c("carrier","tailnum","origin","dest")## year month day dep_time sched_dep_time dep_delay arr_time
## 1: 2013 1 1 517 515 2 830
## 2: 2013 1 1 533 529 4 850
## 3: 2013 1 1 542 540 2 923
## 4: 2013 1 1 544 545 -1 1004
## 5: 2013 1 1 554 600 -6 812
## ---
## 336772: 2013 9 30 NA 1455 NA NA
## 336773: 2013 9 30 NA 2200 NA NA
## 336774: 2013 9 30 NA 1210 NA NA
## 336775: 2013 9 30 NA 1159 NA NA
## 336776: 2013 9 30 NA 840 NA NA
## sched_arr_time arr_delay carrier flight tailnum origin dest
## 1: 819 11 ua 1545 n14228 ewr iah
## 2: 830 20 ua 1714 n24211 lga iah
## 3: 850 33 aa 1141 n619aa jfk mia
## 4: 1022 -18 b6 725 n804jb jfk bqn
## 5: 837 -25 dl 461 n668dn lga atl
## ---
## 336772: 1634 NA 9e 3393 NA jfk dca
## 336773: 2312 NA 9e 3525 NA lga syr
## 336774: 1330 NA mq 3461 n535mq lga bna
## 336775: 1344 NA mq 3572 n511mq lga cle
## 336776: 1020 NA mq 3531 n839mq lga rdu
## air_time distance hour minute time_hour not_delayed
## 1: 227 1400 5 15 2013-01-01 05:00:00 0
## 2: 227 1416 5 29 2013-01-01 05:00:00 0
## 3: 160 1089 5 40 2013-01-01 05:00:00 0
## 4: 183 1576 5 45 2013-01-01 05:00:00 1
## 5: 116 762 6 0 2013-01-01 06:00:00 1
## ---
## 336772: NA 213 14 55 2013-09-30 14:00:00 NA
## 336773: NA 198 22 0 2013-09-30 22:00:00 NA
## 336774: NA 764 12 10 2013-09-30 12:00:00 NA
## 336775: NA 419 11 59 2013-09-30 11:00:00 NA
## 336776: NA 431 8 40 2013-09-30 08:00:00 NA
2.- Can you plot arr_delay vs dep_delay for each carrier wich has more than 25000 fligths in the same plot, in one line and… without using ggplot2? (muahahahahahaha)
par(mfrow=c(2,3))## Empty data.table (0 rows) of 1 col: carrier
par(mfrow=c(1,1))Lead or lag vectors, lists, data.frames or data.tables implemented in C for speed.
today=as.character(Sys.Date())
diamonds[,fake_data:= sample(seq(as.Date(today)-.N,as.Date(today),by="day"),.N)]## z y x price table depth clarity color cut carat
## 1: 2.36 3.71 3.74 367 59 63.4 VS2 E Very Good 0.20
## 2: 2.27 3.75 3.79 345 62 60.2 SI2 E Premium 0.20
## 3: 2.26 3.77 3.79 367 62 59.8 VS2 E Premium 0.20
## 4: 2.24 3.78 3.81 367 60 59.0 VS2 E Premium 0.20
## 5: 2.32 3.78 3.81 367 59 61.1 VS2 E Premium 0.20
## ---
## 53936: 6.17 10.10 10.14 15223 61 61.0 I1 I Premium 4.01
## 53937: 6.24 9.94 10.02 15223 62 62.5 I1 J Premium 4.01
## 53938: 6.43 9.85 10.00 17329 61 64.8 I1 H Fair 4.13
## 53939: 6.72 10.16 10.23 18531 58 65.8 I1 J Fair 4.50
## 53940: 6.98 10.54 10.74 18018 59 65.5 I1 J Fair 5.01
## mean_price mean_carat new_mean_price new_mean_carat fake_data
## 1: 3981.76 0.81 3981.76 0.81 1919-09-28
## 2: 4584.26 0.89 4584.26 0.89 1940-09-02
## 3: 4584.26 0.89 4584.26 0.89 1958-01-07
## 4: 4584.26 0.89 4584.26 0.89 1942-05-12
## 5: 4584.26 0.89 4584.26 0.89 1896-12-22
## ---
## 53936: 4584.26 0.89 4584.26 0.89 1878-11-18
## 53937: 4584.26 0.89 4584.26 0.89 1946-03-04
## 53938: 4358.76 1.05 4358.76 1.05 1869-08-06
## 53939: 4358.76 1.05 4358.76 1.05 2009-09-25
## 53940: 4358.76 1.05 4358.76 1.05 1935-06-23
setkey(diamonds,"fake_data")
diamonds[, last_price := shift( price, n = 1 , type = "lag" ) ]## z y x price table depth clarity color cut carat
## 1: 3.22 5.29 5.23 1346 55 61.2 SI1 F Very Good 0.54
## 2: 3.95 6.30 6.33 6154 54 62.5 VS2 F Premium 1.00
## 3: 3.81 6.30 6.22 3691 62 60.9 SI2 D Very Good 0.94
## 4: 4.26 6.88 6.84 5797 56 62.1 SI1 I Ideal 1.24
## 5: 2.70 4.43 4.40 786 57 61.2 VVS2 F Ideal 0.32
## ---
## 53936: 2.68 4.29 4.31 1013 55 62.3 VVS1 G Ideal 0.30
## 53937: 3.61 5.91 5.87 3584 55 61.3 SI1 E Ideal 0.76
## 53938: 3.24 5.25 5.28 1881 56 61.5 VS2 E Premium 0.55
## 53939: 2.50 4.18 4.15 425 56 60.0 VS1 D Very Good 0.26
## 53940: 4.95 8.41 8.45 15198 60 58.7 SI2 D Premium 2.11
## mean_price mean_carat new_mean_price new_mean_carat fake_data
## 1: 3981.76 0.81 3981.76 0.81 1869-06-23
## 2: 4584.26 0.89 4584.26 0.89 1869-06-24
## 3: 3981.76 0.81 3981.76 0.81 1869-06-25
## 4: 3457.54 0.70 3457.54 0.70 1869-06-26
## 5: 3457.54 0.70 3457.54 0.70 1869-06-27
## ---
## 53936: 3457.54 0.70 3457.54 0.70 2017-02-23
## 53937: 3457.54 0.70 3457.54 0.70 2017-02-24
## 53938: 4584.26 0.89 4584.26 0.89 2017-02-25
## 53939: 3981.76 0.81 3981.76 0.81 2017-02-26
## 53940: 4584.26 0.89 4584.26 0.89 2017-02-27
## last_price
## 1: NA
## 2: 1346
## 3: 6154
## 4: 3691
## 5: 5797
## ---
## 53936: 4480
## 53937: 1013
## 53938: 3584
## 53939: 1881
## 53940: 425
diamonds[, next_price := shift( price, n = 1 , type = "lead" ), by = cut ]## z y x price table depth clarity color cut carat
## 1: 3.22 5.29 5.23 1346 55 61.2 SI1 F Very Good 0.54
## 2: 3.95 6.30 6.33 6154 54 62.5 VS2 F Premium 1.00
## 3: 3.81 6.30 6.22 3691 62 60.9 SI2 D Very Good 0.94
## 4: 4.26 6.88 6.84 5797 56 62.1 SI1 I Ideal 1.24
## 5: 2.70 4.43 4.40 786 57 61.2 VVS2 F Ideal 0.32
## ---
## 53936: 2.68 4.29 4.31 1013 55 62.3 VVS1 G Ideal 0.30
## 53937: 3.61 5.91 5.87 3584 55 61.3 SI1 E Ideal 0.76
## 53938: 3.24 5.25 5.28 1881 56 61.5 VS2 E Premium 0.55
## 53939: 2.50 4.18 4.15 425 56 60.0 VS1 D Very Good 0.26
## 53940: 4.95 8.41 8.45 15198 60 58.7 SI2 D Premium 2.11
## mean_price mean_carat new_mean_price new_mean_carat fake_data
## 1: 3981.76 0.81 3981.76 0.81 1869-06-23
## 2: 4584.26 0.89 4584.26 0.89 1869-06-24
## 3: 3981.76 0.81 3981.76 0.81 1869-06-25
## 4: 3457.54 0.70 3457.54 0.70 1869-06-26
## 5: 3457.54 0.70 3457.54 0.70 1869-06-27
## ---
## 53936: 3457.54 0.70 3457.54 0.70 2017-02-23
## 53937: 3457.54 0.70 3457.54 0.70 2017-02-24
## 53938: 4584.26 0.89 4584.26 0.89 2017-02-25
## 53939: 3981.76 0.81 3981.76 0.81 2017-02-26
## 53940: 4584.26 0.89 4584.26 0.89 2017-02-27
## last_price next_price
## 1: NA 3691
## 2: 1346 11127
## 3: 6154 529
## 4: 3691 786
## 5: 5797 2064
## ---
## 53936: 4480 3584
## 53937: 1013 NA
## 53938: 3584 15198
## 53939: 1881 NA
## 53940: 425 NA
Intended for use in i in [.data.table.
diamonds[cut %like% "Good$" ]## z y x price table depth clarity color cut carat
## 1: 3.22 5.29 5.23 1346 55 61.2 SI1 F Very Good 0.54
## 2: 3.81 6.30 6.22 3691 62 60.9 SI2 D Very Good 0.94
## 3: 2.74 4.53 4.47 529 54 60.9 IF J Very Good 0.33
## 4: 4.27 6.68 6.72 6344 52 63.7 SI2 D Good 1.20
## 5: 3.53 5.54 5.61 2198 60 63.3 SI1 F Very Good 0.71
## ---
## 16984: 3.62 5.86 5.82 2422 57 62.0 SI2 G Very Good 0.76
## 16985: 4.55 7.29 7.25 14976 57 62.6 VS1 F Very Good 1.51
## 16986: 3.39 5.85 5.83 2559 61 58.0 VS1 F Good 0.71
## 16987: 4.59 7.32 7.30 12648 56 62.8 VVS2 H Very Good 1.52
## 16988: 2.50 4.18 4.15 425 56 60.0 VS1 D Very Good 0.26
## mean_price mean_carat new_mean_price new_mean_carat fake_data
## 1: 3981.76 0.81 3981.76 0.81 1869-06-23
## 2: 3981.76 0.81 3981.76 0.81 1869-06-25
## 3: 3981.76 0.81 3981.76 0.81 1869-06-28
## 4: 3928.86 0.85 3928.86 0.85 1869-07-08
## 5: 3981.76 0.81 3981.76 0.81 1869-07-11
## ---
## 16984: 3981.76 0.81 3981.76 0.81 2017-02-16
## 16985: 3981.76 0.81 3981.76 0.81 2017-02-17
## 16986: 3928.86 0.85 3928.86 0.85 2017-02-19
## 16987: 3981.76 0.81 3981.76 0.81 2017-02-20
## 16988: 3981.76 0.81 3981.76 0.81 2017-02-26
## last_price next_price
## 1: NA 3691
## 2: 6154 529
## 3: 786 2198
## 4: 5075 882
## 5: 6505 2574
## ---
## 16984: 6727 14976
## 16985: 2422 12648
## 16986: 596 NA
## 16987: 2559 425
## 16988: 1881 NA
Fast merge of two data.tables. The data.table method behaves very similarly to that of data.frames except that, by default, it attempts to merge
- at first based on the shared key columns, and if there are none,
- then based on key columns of the first argument x, and if there are none,
- then based on the common columns between the two data.tables.
X = data.table(x = c("a","a","a","a","b","b","c","d"), some_staf = 1:8, key = "x")
Y = data.table(x = c("a","c","z"), z = c(100,500,1000), key = "x")
# You can do it two steps setkey(X, x)
X[Y]## x some_staf z
## 1: a 1 100
## 2: a 2 100
## 3: a 3 100
## 4: a 4 100
## 5: c 7 500
## 6: z NA 1000
Y[X]## x z some_staf
## 1: a 100 1
## 2: a 100 2
## 3: a 100 3
## 4: a 100 4
## 5: b NA 5
## 6: b NA 6
## 7: c 500 7
## 8: d NA 8
merge(X,Y,all=T)## x some_staf z
## 1: a 1 100
## 2: a 2 100
## 3: a 3 100
## 4: a 4 100
## 5: b 5 NA
## 6: b 6 NA
## 7: c 7 500
## 8: d 8 NA
## 9: z NA 1000
merge(X,Y,all=F)## x some_staf z
## 1: a 1 100
## 2: a 2 100
## 3: a 3 100
## 4: a 4 100
## 5: c 7 500
THE BEST FUNCTION FOR LOADING .CSV FILES
As write.csv but much faster (e.g. 2 seconds versus 1 minute) and just as flexible.
Data.table has optimized function for melting and casting a data.table.
IDateTime
system.time(IDateTime(flights[["time_hour"]]))## user system elapsed
## 0.038 0.012 0.049
nrow(flights)## [1] 336776
#CRYPipe operators for data.table
library(magrittr)
diamonds %>% setDT() %>%
.[, .(mean_carat=mean(carat),sd_carat=sd(carat)), by=.(cut,color)] %>%
.[order(mean_carat)] %>%
.[1:3]## cut color mean_carat sd_carat
## 1: Ideal D 0.5657657 0.2993503
## 2: Ideal E 0.5784012 0.3125406
## 3: Ideal F 0.6558285 0.3745245
https://cran.r-project.org/web/packages/magrittr/vignettes/magrittr.html
1.- magrittr way! Calculate the mean of not delayed flights by company. Use the not_delayed column and name it as perc_not_delayed! Use the chaining to order by perc_not_delayed. (Add .N column)
## carrier perc_not_delayed N
## 1: ha 0.7573099 342
## 2: us 0.7276707 20536
## 3: oo 0.6896552 32
## 4: as 0.6432584 714
## 5: aa 0.6332845 32729
## 6: mq 0.6288996 26397
## 7: dl 0.6208831 48110
## 8: 9e 0.5579927 18460
## 9: b6 0.5529362 54635
## 10: yv 0.5394495 601
## 11: ev 0.5171353 54173
## 12: vx 0.4862600 5162
## 13: ua 0.4712223 58665
## 14: fl 0.4395984 3260
## 15: f9 0.4354839 685
## 16: wn 0.3755690 12275
2.-for each fligth can you tell wich was the last carrier that did the same travel?
require(lubridate)## Loading required package: lubridate
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:data.table':
##
## hour, isoweek, mday, minute, month, quarter, second, wday,
## week, yday, year
## The following object is masked from 'package:base':
##
## date
flights[,dep_time_stamp:=ymd_hm(paste(year,month,day,hour,minute))]## year month day dep_time sched_dep_time dep_delay arr_time
## 1: 2013 1 1 517 515 2 830
## 2: 2013 1 1 533 529 4 850
## 3: 2013 1 1 542 540 2 923
## 4: 2013 1 1 544 545 -1 1004
## 5: 2013 1 1 554 600 -6 812
## ---
## 336772: 2013 9 30 NA 1455 NA NA
## 336773: 2013 9 30 NA 2200 NA NA
## 336774: 2013 9 30 NA 1210 NA NA
## 336775: 2013 9 30 NA 1159 NA NA
## 336776: 2013 9 30 NA 840 NA NA
## sched_arr_time arr_delay carrier flight tailnum origin dest
## 1: 819 11 ua 1545 n14228 ewr iah
## 2: 830 20 ua 1714 n24211 lga iah
## 3: 850 33 aa 1141 n619aa jfk mia
## 4: 1022 -18 b6 725 n804jb jfk bqn
## 5: 837 -25 dl 461 n668dn lga atl
## ---
## 336772: 1634 NA 9e 3393 NA jfk dca
## 336773: 2312 NA 9e 3525 NA lga syr
## 336774: 1330 NA mq 3461 n535mq lga bna
## 336775: 1344 NA mq 3572 n511mq lga cle
## 336776: 1020 NA mq 3531 n839mq lga rdu
## air_time distance hour minute time_hour not_delayed
## 1: 227 1400 5 15 2013-01-01 05:00:00 0
## 2: 227 1416 5 29 2013-01-01 05:00:00 0
## 3: 160 1089 5 40 2013-01-01 05:00:00 0
## 4: 183 1576 5 45 2013-01-01 05:00:00 1
## 5: 116 762 6 0 2013-01-01 06:00:00 1
## ---
## 336772: NA 213 14 55 2013-09-30 14:00:00 NA
## 336773: NA 198 22 0 2013-09-30 22:00:00 NA
## 336774: NA 764 12 10 2013-09-30 12:00:00 NA
## 336775: NA 419 11 59 2013-09-30 11:00:00 NA
## 336776: NA 431 8 40 2013-09-30 08:00:00 NA
## N dep_time_stamp
## 1: 58665 2013-01-01 05:15:00
## 2: 58665 2013-01-01 05:29:00
## 3: 32729 2013-01-01 05:40:00
## 4: 54635 2013-01-01 05:45:00
## 5: 48110 2013-01-01 06:00:00
## ---
## 336772: 18460 2013-09-30 14:55:00
## 336773: 18460 2013-09-30 22:00:00
## 336774: 26397 2013-09-30 12:10:00
## 336775: 26397 2013-09-30 11:59:00
## 336776: 26397 2013-09-30 08:40:00
## dep_time_stamp origin dest carrier last_carrier
## 1: 2013-01-01 13:17:00 ewr alb ev NA
## 2: 2013-01-01 16:21:00 ewr alb ev ev
## 3: 2013-01-01 20:04:00 ewr alb ev ev
## 4: 2013-01-02 13:27:00 ewr alb ev ev
## 5: 2013-01-02 16:21:00 ewr alb ev ev
## ---
## 336772: 2013-12-27 07:15:00 lga xna mq mq
## 336773: 2013-12-29 15:20:00 lga xna mq mq
## 336774: 2013-12-30 07:15:00 lga xna mq mq
## 336775: 2013-12-30 15:20:00 lga xna mq mq
## 336776: 2013-12-31 07:15:00 lga xna mq mq
3.- Merge flights with airlines. The data.table way! setkey and merge!
data("airlines",package = "nycflights13" )
setDT(airlines)## year month day dep_time sched_dep_time dep_delay arr_time
## 1: NA NA NA NA NA NA NA
## 2: NA NA NA NA NA NA NA
## 3: NA NA NA NA NA NA NA
## sched_arr_time arr_delay carrier flight tailnum origin dest air_time
## 1: NA NA 9E NA NA NA NA NA
## 2: NA NA AA NA NA NA NA NA
## 3: NA NA AS NA NA NA NA NA
## distance hour minute time_hour not_delayed N dep_time_stamp
## 1: NA NA NA <NA> NA NA <NA>
## 2: NA NA NA <NA> NA NA <NA>
## 3: NA NA NA <NA> NA NA <NA>
## last_carrier name
## 1: NA Endeavor Air Inc.
## 2: NA American Airlines Inc.
## 3: NA Alaska Airlines Inc.
4.- Use the function IDateTime to create simultaniously two columns. First, create a vector with the two column names. Second DT[,():=IDateTime(...)] notation.
## year month day dep_time sched_dep_time dep_delay arr_time
## 1: 2013 1 1 810 810 0 1048
## 2: 2013 1 1 1452 1455 -3 1637
## 3: 2013 1 1 1451 1500 -9 1634
## 4: 2013 1 1 1454 1500 -6 1635
## 5: 2013 1 1 1507 1515 -8 1651
## ---
## 336772: 2013 12 27 1147 1150 -3 1342
## 336773: 2013 12 28 1540 1535 5 1708
## 336774: 2013 12 30 1143 1150 -7 1348
## 336775: 2013 12 31 1143 1150 -7 1341
## 336776: 2013 12 31 1430 1432 -2 1546
## sched_arr_time arr_delay carrier flight tailnum origin dest
## 1: 1037 11 9e 3538 n915xj jfk msp
## 2: 1639 -2 9e 3295 n920xj jfk buf
## 3: 1636 -2 9e 4105 n8444f jfk iad
## 4: 1636 -1 9e 3843 n8409n jfk syr
## 5: 1656 -5 9e 3792 n8631e jfk roc
## ---
## 336772: 1406 -24 yv 2885 n905fj lga clt
## 336773: 1700 8 yv 2889 n922fj lga phl
## 336774: 1406 -18 yv 2885 n918fj lga clt
## 336775: 1406 -25 yv 2885 n911fj lga clt
## 336776: 1555 -9 yv 3771 n515mj lga iad
## air_time distance hour minute time_hour not_delayed
## 1: 189 1029 8 10 2013-01-01 08:00:00 0
## 2: 68 301 14 55 2013-01-01 14:00:00 1
## 3: 57 228 15 0 2013-01-01 15:00:00 1
## 4: 57 209 15 0 2013-01-01 15:00:00 1
## 5: 66 264 15 15 2013-01-01 15:00:00 1
## ---
## 336772: 98 544 11 50 2013-12-27 11:00:00 1
## 336773: 32 96 15 35 2013-12-28 15:00:00 0
## 336774: 95 544 11 50 2013-12-30 11:00:00 1
## 336775: 103 544 11 50 2013-12-31 11:00:00 1
## 336776: 52 229 14 32 2013-12-31 14:00:00 1
## N dep_time_stamp last_carrier day_flight time_flight
## 1: 18460 2013-01-01 08:10:00 NA 2013-01-01 08:00:00
## 2: 18460 2013-01-01 14:55:00 b6 2013-01-01 14:00:00
## 3: 18460 2013-01-01 15:00:00 ev 2013-01-01 15:00:00
## 4: 18460 2013-01-01 15:00:00 b6 2013-01-01 15:00:00
## 5: 18460 2013-01-01 15:15:00 b6 2013-01-01 15:00:00
## ---
## 336772: 601 2013-12-27 11:50:00 us 2013-12-27 11:00:00
## 336773: 601 2013-12-28 15:35:00 us 2013-12-28 15:00:00
## 336774: 601 2013-12-30 11:50:00 us 2013-12-30 11:00:00
## 336775: 601 2013-12-31 11:50:00 us 2013-12-31 11:00:00
## 336776: 601 2013-12-31 14:32:00 ev 2013-12-31 14:00:00
Enjoy your brand new data.table knowledge!
From basics to advanced!
We expect that you meet new people!
We use to do beers afterwards!
Availables for further questions!